Table of Contents

Introduction #

Social media has become a key platform for discussions on global issues, and the Covid-19 pandemic was no exception. Millions of users shared their opinions on Twitter regarding Covid-19 vaccines, ranging from strong approval to skepticism and misinformation. To understand these opinions better, I conducted a Twitter Covid Vaccine Sentiment Analysis using Natural Language Processing (NLP) for an assignment during my bachelor’s degree.

This project aimed to explore how the public reacted to Covid-19 vaccines over time, which vaccines were more favored, and how misinformation played a role in shaping discussions. In this blog, I’ll walk you through the data collection process, sentiment analysis techniques, and key insights obtained from over 614,000 tweets related to Covid-19 vaccines.

Data Collection & Preprocessing #

1. Data Source #

The dataset was obtained from Kaggle, which contained tweets about Covid-19 vaccines collected by different users. The dataset consisted of two main sources:

Covid Vaccine Tweets
COVID-19 All Vaccines Tweets

These datasets were merged, resulting in a final dataset of 614,074 tweets spanning from January 2020 to April 2022. The dataset provided an extensive snapshot of public sentiment throughout different stages of the pandemic, including vaccine development, approvals, and rollouts.

	id	user_name	user_location	user_description	user_created	user_followers	user_friends	user_favourites	user_verified	date	text	hashtags	source	retweets	favorites	is_retweet
0	1340539111971516416	Rachel Roh	La Crescenta-Montrose, CA	Aggregator of Asian American news; scanning diverse sources 24/7/365. RT\'s, Follows and \'Likes\' will fuel me 👩\u200d💻	2009-04-08 17:52:46	405	1692	3247	False	2020-12-20 06:06:44	Same folks said daikon paste could treat a cytokine storm #PfizerBioNTech https://t.co/xeHhIMg1kF	[\'PfizerBioNTech\']	Twitter for Android	0	0	False
1	1338158543359250433	Albert Fong	San Francisco, CA	Marketing dude, tech geek, heavy metal & \'80s music junkie. Fascinated by meteorology and all things in the cloud. Opinions are my own.	2009-09-21 15:27:30	834	666	178	False	2020-12-13 16:27:13	While the world has been on the wrong side of history this year, hopefully, the biggest vaccination effort we\'ve ev… https://t.co/dlCHrZjkhm	NaN	Twitter Web App	1	1	False

2. Preprocessing Steps #

Before applying sentiment analysis, the data underwent extensive cleaning and transformation to remove noise and standardize text for analysis. The following steps were implemented:

Eliminating Duplicate Tweets and Bot-Generated Content: To avoid skewing results.
Removing URLs, mentions, and hashtags to focus only on the textual content.
Tokenization: Splitting sentences into individual words.
Lemmatization: Converting words to their base forms (e.g., “running” → “run”).
Removing Stop Words: Filtering out common words like “the,” “and,” and “is” that don’t contribute to sentiment.
Handling Special Characters and Emojis: Converting emojis into text representations to retain sentiment.

For these tasks, Python libraries such as TextBlob, NLTK, pandas, and NeatText were used. The goal was to create a dataset that accurately reflects human sentiment without irrelevant data points affecting the results. After cleaning the data, the resulting dataset consist of 482,523 tweets.

Sentiment Analysis Methodology #

1. Sentiment Classification #

Each tweet was classified into one of three sentiment categories:

Positive: Favorable opinions about Covid-19 vaccines.
Neutral: Informational or non-opinionated tweets.
Negative: Skepticism, misinformation, or distrust toward vaccines.

This classification was done using TextBlob, a Python library that assigns polarity scores to text:

Polarity ranges from -1 (negative) to +1 (positive).
A polarity score >0 is considered positive, <0 is negative, and 0 is neutral.

2. Subjectivity Analysis #

We also measured subjectivity, which determines how factual vs. opinionated a tweet is. Subjectivity scores helped distinguish factual news reports from personal opinions, allowing us to see how much of the vaccine discourse was based on emotions rather than verifiable facts.

Sentiment Analysis with TextBlob #

This Python function leverages the TextBlob library to analyze the sentiment of a given text input. It returns a dictionary containing the polarity, subjectivity, and overall sentiment classification of the text.

from textblob import TextBlob

def analyze_sentiment(text):

    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    subjectivity = analysis.sentiment.subjectivity

    if polarity > 0:
        sentiment = 'Positive'
    elif polarity == 0:
        sentiment = 'Neutral'
    else:
        sentiment = 'Negative'

    result = {
        'polarity': polarity,
        'subjectivity': subjectivity,
        'sentiment': sentiment
    }

    return result

Key Findings #

1. Overall Sentiment Distribution #

The dataset showed the following sentiment distribution:

42.6% Positive
43.8% Neutral
13.6% Negative

This indicates that while the majority of tweets were neutral, positive sentiment toward vaccines slightly outweighed negative sentiment. This is an encouraging insight, showing that, despite vaccine hesitancy and misinformation, social media users were largely supportive or at least informative about vaccines.

2. Vaccine-Specific Sentiment #

The sentiment scores for different vaccines were as follows:

Vaccine	Polarity	Subjectivity
Pfizer	0.1163	0.3176
AstraZeneca	0.114	0.2685
Sputnik	0.1082	0.3041
Covaxin	0.1080	0.2541
Moderna	0.1047	0.2954

Pfizer had the highest acceptance based on polarity.
Moderna had the lowest polarity but was still above 0, indicating positive sentiment overall.
Covaxin had the lowest subjectivity, meaning more objective statements were made about it.

These results reflect how different vaccines were received by the public and provide insights into brand trust and perception.

3. Time-Series Analysis of Sentiment #

Analyzing sentiment over time revealed key trends:

Early 2020 had low tweet activity about vaccines due to the lack of available information.
Sentiment spiked in December 2020, aligning with the release of Pfizer’s vaccine under EUA.
The highest spike in sentiment occurred in August 2021, coinciding with the approval of the third dose in the U.S.

4. Most Common Words in Sentiment Categories #

Using word clouds, we identified frequently used words in different sentiment categories:

Positive Words: #

Vaccine, Efficient, Thankful, Safe, Amazing, Voluntary

Negative Words: #

Dangerous, Scared, Misinformation, Side-effects, Risky

Neutral Words: #

Vaccine, Doses, Health, Available, Announcement

Challenges & Limitations #

While the analysis provided valuable insights, it also faced some limitations:

Bias in Twitter Data: The dataset may not represent the global population’s opinion.
Irony & Sarcasm Detection: Some tweets with sarcasm may have been misclassified.
Bot-Generated Tweets: Despite filtering, some automated tweets could have influenced results.

Conclusion & Takeaways #

This project provided a data-driven perspective on public sentiment toward Covid-19 vaccines, highlighting key trends and reactions. The main takeaways are:

Public sentiment was largely neutral to positive.
Pfizer had the most positive perception among vaccines.
Sentiment spiked during key vaccine approval milestones.

Understanding public opinion is crucial for public health campaigns, combating misinformation, and improving vaccine distribution strategies. Future improvements could include deep learning sentiment models and real-time analysis of vaccine perception.

Thank you for reading! If you have any questions or comments, please feel free to contact me. Your feedback is highly appreciated.

Keywords: NLP, Sentiment Analysis, Covid, Machine Learning, Data Science